As the housing market is highly replied on the price predictions for both sellers and buyers, it is essential to make a highly accurate prediction model for them. Zillow is an American real estate company that has a large database of homes for sale and rent. However, because the range of their data is across the United States, it is hard for them to make a precise housing price prediction in a specific city. The purpose of the project is to help Zillow to make a prediction model of the housing price in Philadelphia based on the local characteristics. The model would provide a strong guide for sellers and buyers to make decisions so that they would not overpay much. It would also reduce the risks of credibility and enhance the market efficiency.
The overall modeling strategy for the project is using Ordinary Least Squares (OLS) regression with multi-variables, which is a popular statistical method used to examine the relationship between a dependent variable and predictors that might affect the value of the dependent variable. The method provides the strength of the relationship, the direction of the relationship and the goodness of the model fit. In our case, there are still some challenges we should overcome. First of all, we need to filter out the most significant and diverse variables to use in this model and find the data from the reliable sources. Second, we need to provide strong evidence and clear visualizations of plots and maps with comprehensive interpretations for the audience. Third, we should state the limitations of the OLS regression results. There are still unpredictable factors in the real world, such as Covid-19 pandemic, which impact the housing prices negatively.
Our results of the model indicates that using the local data can enhance the accuracy of housing price predictions for Philadelphia.
To decide the specific variables that might affect the housing prices, we start from three categories: interior conditions, neighborhood environments, and demographic characteristics. According to Zillow’s website, the living area, the house age and utilities are listed in the overview section in each house’s information, which are the top factors that people consider the most when they buy a house. Therefore, we chose the column “total_area”, “year_built”, “quality_grade” in the original dataset to represent these three factors. For the environmental factors, people normally consider the safety and convenience to the public resources, so we explored the OpenDataPhilly website and used the data of shooting victims, commercial corridors and schools. For the demographics characteristics, we retrieved the census data in Philadelphia in 2021 from U.S. Census Bureau, which contains the median household income, percentage of foreign-born residents, and the geographic mobility to a different house.
# Load Libraries
library(geos)
library(rsgeo)
library(tidyverse)
library(tidycensus)
library(sf)
library(kableExtra)
library(knitr)
library(dplyr)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(stargazer)
options(scipen=999)
options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
Since our project is heavily focused on geographic features, we retrieved the block groups and neighborhoods boundaries from OpenDataPhilly, and we filtered the variables that we need to use from the housing price raw data.
census_api_key("6a81f5ae68fcb8e26d2f0a80f4232c5e503b553d", overwrite = TRUE)
Philly_block_groups <-
st_read("https://opendata.arcgis.com/datasets/2f982bada233478ea0100528227febce_0.geojson") %>%
st_transform('ESRI:102728')
neighborhood <- st_read("/Users/rachelren/Desktop/Upenn/MUSA_5080/Midterm/MUSA508_Midterm_Li_Ren/Data/Neighborhoods_Philadelphia.geojson") %>%
st_transform('ESRI:102728')
Philly_price_all <-
st_read("/Users/rachelren/Desktop/Upenn/MUSA_5080/Midterm/MUSA508_Midterm_Li_Ren/Data/studentData.geojson") %>%
st_transform('ESRI: 102728')
Philly_price_clean <- Philly_price_all[, c("objectid", "census_tract", "total_area", "quality_grade", "sale_price", "year_built", "geometry", "toPredict")] ## Select columns we use
## Calculate the age of the house
Philly_price_clean <-
Philly_price_clean %>%
mutate(houseAge = ifelse((year_built > 0 & year_built < 2023), 2023-year_built, 0))
## Data of the housing nearby environments
Shooting_victims <-
st_read("/Users/rachelren/Desktop/Upenn/MUSA_5080/Midterm/MUSA508_Midterm_Li_Ren/Data/shootings.geojson") %>%
st_transform('ESRI: 102728')
PPR_Sites <-
st_read("/Users/rachelren/Desktop/Upenn/MUSA_5080/Midterm/MUSA508_Midterm_Li_Ren/Data/PPR_Program_Sites.geojson") %>%
st_transform('ESRI: 102728')
Commercial_Corridors <-
st_read("/Users/rachelren/Desktop/Upenn/MUSA_5080/Midterm/MUSA508_Midterm_Li_Ren/Data/Commercial_Corridors.geojson") %>%
st_transform('ESRI: 102728')
schools <-
read.csv("/Users/rachelren/Desktop/Upenn/MUSA_5080/Midterm/MUSA508_Midterm_Li_Ren/Data/Schools.csv")
schools.sf <-
schools %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102728')
##Get and clean the variables in the 2021 Census data
## all of the variables are in block group
blockgroup21 <-
get_acs(geography = "block group",
variables = c("B19013_001E","B99051_005E",
"B01003_001E","B07201_003E",
"B07201_001E"),
year=2021, state=42,
county=101, geometry=TRUE) %>%
st_transform('ESRI:102728')
variables2021 <- load_variables(2021,'acs5')
variables2021_blockGroup <- load_variables(2021,'acs5') %>%
dplyr::filter(geography == 'block group')
variables2021_tract <- load_variables(2021,'acs5') %>%
dplyr::filter(geography == 'tract')
blockgroup21 <-
blockgroup21 %>%
dplyr::select( -NAME, -moe) %>%
spread(key = variable, value = estimate) %>%
rename(TotalPop = B01003_001,
ForeignBorn = B99051_005,
MedHHInc = B19013_001,
MobilityDifferentHouse = B07201_003,
MobilityTotal = B07201_001
)
## Calculate the percentage of foreign born and the mobility rate
blockgroup21 <-
blockgroup21 %>%
mutate(pctForeign = ifelse(TotalPop > 0, ForeignBorn / TotalPop, 0),
pctMobility = ifelse(MobilityTotal > 0, (MobilityDifferentHouse / MobilityTotal), 0),
year = "2021") %>%
dplyr::select(-ForeignBorn,-MobilityDifferentHouse,-MobilityTotal,-TotalPop)
In order to better apply the environmental impact to the housing prices, we use two different ways. For the shooting data, we selected the number of victims who suffered the gun shot on head within 1000 feet around each house because the head wound is the most dangerous body part that caused the death. Also, because the location and time of the crime is random, a certain buffer is appropriate to measure the safety levels in a community. Even a fatal crime in a relatively far distance can cause the panic and affect the housing price. Therefore, we choose the buffer to sum crimes rather than using the average nearest neighbor distance.
Shooting_victims %>%
group_by(wound) %>%
summarize(count = n()) %>%
arrange(-count)
PhillyShootingHead.sf <-
Shooting_victims %>%
filter(wound == "Head",
point_y > -1) %>%
dplyr::select(point_y, point_x) %>%
na.omit() %>%
st_as_sf(coords = c("Long", "Lat")) %>%
st_transform('ESRI:102728') %>%
distinct()
Philly_price_clean$shooting.Buffer <- Philly_price_clean %>%
st_buffer(1000) %>%
aggregate(mutate(PhillyShootingHead.sf, counter = 1),., sum) %>%
pull(counter)
The second way is applied to schools, commercial corridors and public parks and recreational centers. We used the method of calculating the Euclidean distance this time because the distance to these public resources are much more important. It is used to represent the convenience of living in a house. The variables for the further analysis are the distances from each house to its nearest public amenity.
nearest_school <- sf::st_nearest_feature(Philly_price_clean, schools.sf)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(schools.sf)
Philly_price_clean$dist_to_school <- rsgeo::distance_euclidean_pairwise(x, y[nearest_school])
nearest_Commercial <- sf::st_nearest_feature(Philly_price_clean, Commercial_Corridors)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(Commercial_Corridors)
Philly_price_clean$dist_to_commerce <- rsgeo::distance_euclidean_pairwise(x, y[nearest_Commercial])
nearest_PPR <- sf::st_nearest_feature(Philly_price_clean, PPR_Sites)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(PPR_Sites)
Philly_price_clean$dist_to_PPR <- rsgeo::distance_euclidean_pairwise(x, y[nearest_PPR])
For the census variables, we simply joined the data of each block groups to the houses within that block group. We also assigned the neighborhood of each house for the future use.
Philly_Housing_joined <- st_join(Philly_price_clean, blockgroup21, join = st_within)
Philly_Housing_joined <- st_join(Philly_Housing_joined, neighborhood, join = st_within)
numeric_columns <- sapply(Philly_Housing_joined, is.numeric)
for (col in names(Philly_Housing_joined)[numeric_columns]) {
col_mean <- mean(Philly_Housing_joined[[col]], na.rm = TRUE)
Philly_Housing_joined[[col]][is.na(Philly_Housing_joined[[col]])] <- col_mean
}
Philly_price_Model <- Philly_Housing_joined %>%
filter(toPredict == "MODELLING") ## select data containing the sales price
summary_fields <- Philly_price_Model[c("total_area","houseAge", "MedHHInc", "pctForeign", "pctMobility", "shooting.Buffer", "dist_to_school", "dist_to_commerce", "dist_to_PPR")]
#table_column_labels = c("Exterior Condition", "Bedrooms", "House Age", "")
##descriptions <- c("total_area","houseAge", "MedHHInc", "pctForeign", "pctMobility", "shooting.Buffer", "dist_to_school", "dist_to_commerce", "dist_to_PPR")
stargazer(as.data.frame(summary_fields),
type = "html",
title = "Table 2.1: Summary Statistics",
##add.lines = list("Description" = descriptions),
digits = 1,
column.sep.width = "1cm",
flip = TRUE)
##
## <table style="text-align:center"><caption><strong>Table 2.1: Summary Statistics</strong></caption>
## <tr><td colspan="10" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>total_area</td><td>houseAge</td><td>MedHHInc</td><td>pctForeign</td><td>pctMobility</td><td>shooting.Buffer</td><td>dist_to_school</td><td>dist_to_commerce</td><td>dist_to_PPR</td></tr>
## <tr><td colspan="10" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">N</td><td>23,781</td><td>23,781</td><td>23,781</td><td>23,781</td><td>23,781</td><td>23,781</td><td>23,781</td><td>23,781</td><td>23,781</td></tr>
## <tr><td style="text-align:left">Mean</td><td>1,828.9</td><td>85.4</td><td>65,569.4</td><td>0.1</td><td>0.1</td><td>3.4</td><td>1,147.6</td><td>670.4</td><td>1,796.8</td></tr>
## <tr><td style="text-align:left">St. Dev.</td><td>3,818.1</td><td>29.7</td><td>31,760.0</td><td>0.1</td><td>0.1</td><td>2.5</td><td>647.6</td><td>639.0</td><td>1,032.3</td></tr>
## <tr><td style="text-align:left">Min</td><td>0.0</td><td>0</td><td>6,302.0</td><td>0.0</td><td>0.0</td><td>1.0</td><td>32.4</td><td>0.0</td><td>79.9</td></tr>
## <tr><td style="text-align:left">Max</td><td>226,512.0</td><td>273</td><td>250,001.0</td><td>0.8</td><td>0.8</td><td>32.0</td><td>5,900.8</td><td>7,385.7</td><td>10,317.7</td></tr>
## <tr><td colspan="10" style="border-bottom: 1px solid black"></td></tr></table>
numericVars <-
select_if(st_drop_geometry(Philly_price_Model), is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation across numeric variables")
ggplot(Philly_price_Model, aes(x = MedHHInc, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = "Price as a function of median household income",
x = "Median Household Income",
y = "Home Price"
) +
theme_minimal()
ggplot(Philly_price_Model, aes(x = houseAge, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = "Price as a function of house age",
x = "Age",
y = "Home Price"
) +
theme_minimal()
### Home Price Correlation Scatterplot with Distance to Schools
ggplot(Philly_price_Model, aes(x = dist_to_school, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(
title = "Price as a function of distance to school",
x = "Distance to School",
y = "Home Price"
) +
theme_minimal()
### Home Price Correlation with Distance to Parks and Recreation
ggplot(Philly_price_Model, aes(x = dist_to_PPR, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(
title = "Price as a function of distance to Parks and Recreation",
x = "Distance to Parks and Recreation",
y = "Home Price"
) +
theme_minimal()
mean_sale_price <- Philly_price_Model %>%
group_by(GEOID) %>%
summarize(MeanSalePrice = mean(sale_price))
blockgroup21 <- st_join(blockgroup21, mean_sale_price, by = "GEOID")
blockgroup21$MeanSalePrice[is.na(blockgroup21$MeanSalePrice)] <- 0
ggplot() +
geom_sf(data = blockgroup21, aes(fill = MeanSalePrice)) +
scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Sale Price (U.S. Dollars)") +
labs(title = "Choropleth Map of Housing Sale Price by Block Groups") +
theme_minimal()
### Develop a map of most interesting independent variables: Geographic
Mobility
ggplot() +
geom_sf(data = blockgroup21, aes(fill = pctMobility)) +
scale_fill_gradient(low = "yellow", high = "red", name = "Geographic Mobility Rate") +
labs(title = "Choropleth Map of Geographic Mobility by Block Groups") +
theme_minimal()
### Develop map of most interesting independent variables: School
Density
mean_school_dis <- Philly_price_Model %>%
group_by(GEOID) %>%
summarize(MeanSchoolDis = mean(dist_to_school))
blockgroup21 <- st_join(blockgroup21, mean_school_dis, by = "GEOID")
blockgroup21$MeanSchoolDis[is.na(blockgroup21$MeanSchoolDis)] <- 0
ggplot() +
geom_sf(data = blockgroup21, aes(fill = MeanSchoolDis)) +
scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Average Distance") +
labs(title = "Choropleth Map of Distance to school by Block Groups") +
theme_minimal()
ggplot() + geom_sf(data = neighborhood, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(PhillyShootingHead.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
scale_alpha(range = c(0.00, 0.35), guide = "none") +
labs(title = "Density map of Shootings in head distribution, Philadelphia") +
mapTheme()
## Methods
inTrain <- createDataPartition(
y = paste(Philly_price_Model$quality_grade),
p = .60, list = FALSE)
PhillyPrice.training <- Philly_price_Model[inTrain,]
PhillyPrice.test <- Philly_price_Model[-inTrain,]
reg.training <-
lm(sale_price ~ ., data = as.data.frame(PhillyPrice.training) %>%
dplyr::select(sale_price, total_area, quality_grade, houseAge, shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR, MedHHInc, pctForeign, pctMobility))
summary_lm <- summary(reg.training)
# Create a table of the model summary using stargazer
stargazer(reg.training,
title = "Linear Regression Summary",
type = "text",
align = TRUE
)
##
## Linear Regression Summary
## ===============================================
## Dependent variable:
## ---------------------------
## sale_price
## -----------------------------------------------
## total_area 9.133***
## (0.470)
##
## quality_grade2 -368,290.800*
## (191,712.500)
##
## quality_grade3 -6,442.866
## (37,522.490)
##
## quality_grade4 -80,315.700
## (111,092.900)
##
## quality_grade5 -39,056.070
## (191,652.400)
##
## quality_grade6 -165,997.100
## (191,672.400)
##
## quality_gradeA 640,604.500***
## (48,604.800)
##
## quality_gradeA- 471,698.400***
## (41,596.970)
##
## quality_gradeA+ 1,021,304.000***
## (57,538.410)
##
## quality_gradeB 25,680.670*
## (14,028.460)
##
## quality_gradeB- 74,011.110***
## (15,489.140)
##
## quality_gradeB+ 53,687.290***
## (17,195.010)
##
## quality_gradeC -122,750.800***
## (12,517.850)
##
## quality_gradeC- -156,576.300***
## (15,481.280)
##
## quality_gradeC+ -61,474.440***
## (12,910.320)
##
## quality_gradeD -164,485.900***
## (21,958.710)
##
## quality_gradeD- -171,151.400**
## (73,355.040)
##
## quality_gradeD+ -212,778.800***
## (41,775.330)
##
## quality_gradeE -190,203.500***
## (50,946.530)
##
## quality_gradeE- -21,892.160
## (111,162.000)
##
## quality_gradeE+ -69,774.360
## (79,078.330)
##
## quality_gradeS 20,349.300
## (135,804.900)
##
## quality_gradeS+ 151,962.500
## (191,725.000)
##
## quality_gradeX 1,643,484.000***
## (142,636.200)
##
## quality_gradeX- 901,192.900***
## (192,048.700)
##
## houseAge -276.275***
## (60.795)
##
## shooting.Buffer -7,416.102***
## (670.568)
##
## dist_to_school -1.735
## (2.746)
##
## dist_to_commerce -22.242***
## (2.766)
##
## dist_to_PPR -2.303
## (1.719)
##
## MedHHInc 2.708***
## (0.054)
##
## pctForeign 59,754.540***
## (13,165.530)
##
## pctMobility 257,478.900***
## (15,698.030)
##
## Constant 203,314.300***
## (14,027.520)
##
## -----------------------------------------------
## Observations 14,278
## R2 0.390
## Adjusted R2 0.389
## Residual Std. Error 191,260.100 (df = 14244)
## F Statistic 276.443*** (df = 33; 14244)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(sale_price ~ ., data = st_drop_geometry(Philly_price_Model) %>%
dplyr::select(sale_price,
total_area, quality_grade,
houseAge, shooting.Buffer, dist_to_school,
dist_to_commerce, dist_to_PPR, MedHHInc,
pctForeign, pctMobility),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv
## Linear Regression
##
## 23781 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 23542, 23542, 23544, 23544, 23544, 23542, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 180836.7 0.4208634 107655.1
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
mean(reg.cv$resample[,3])
## [1] 107655.1
sd(reg.cv$resample[,3])
## [1] 9851.679
cv_results <- data.frame(Resamples = names(reg.cv$resample),
MAE = reg.cv$resample$MAE)
# Create a histogram of MAE values
ggplot(cv_results, aes(x = MAE)) +
geom_histogram(binwidth = 1000, fill = "blue", color = "black") +
labs(title = "Cross-Validation Mean Absolute Error (MAE) Histogram",
x = "Mean Absolute Error (MAE)",
y = "Frequency")
PhillyPrice.test <-
PhillyPrice.test %>%
mutate(SalePrice.Predict = predict(reg.training, PhillyPrice.test),
SalePrice.Error = SalePrice.Predict - sale_price,
SalePrice.AbsError = abs(SalePrice.Predict - sale_price),
SalePrice.APE = (abs(SalePrice.Predict - sale_price)) / SalePrice.Predict)
scatter_plot <- ggplot(data = PhillyPrice.test, aes(x = SalePrice.Predict, y = sale_price)) +
geom_point() + # Add scatter points
geom_abline(intercept = 0, slope = 1, color = "orange", linetype = "dashed") + # Perfect fit line
geom_abline(intercept = mean(PhillyPrice.test$sale_price), slope = 1, color = "green", linetype = "dashed") + # Average predicted fit line
labs(title = "Scatter Plot of SalePrice vs. SalePrice.Predict",
x = "SalePrice.Predict",
y = "SalePrice")
# Print the scatter plot
print(scatter_plot)
### Provide a map of your residuals for your test set.
palette5 <- c("#25CB10", "#5AB60C", "#8FA108", "#C48C04", "#FA7800")
ggplot() +
geom_sf(data = blockgroup21, fill = "grey40") +
geom_sf(data = PhillyPrice.test, aes(colour = q5(SalePrice.Error)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
name="Quintile\nBreaks") +
labs(title="Map of Residuals of test set") +
mapTheme()
### Moran’s I test
coords <- st_coordinates(Philly_price_Model)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
Philly_price_Model$lagPrice <- lag.listw(spatialWeights, Philly_price_Model$sale_price)
coords.test <- st_coordinates(PhillyPrice.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
PhillyPrice.test <- PhillyPrice.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error))
moranTest <- moran.mc(PhillyPrice.test$SalePrice.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
### Scatter plot of error as a function of the spatial lag of price
ggplot(PhillyPrice.test, aes(x = lagPriceError, y = SalePrice.Error)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "orange")+
labs(title = "Error as a function of the spatial lag of price",
x = "lagPriceError",
y = "SalePrice.Error")
### Provide a map of your predicted values for where ‘toPredict’ is both
“MODELLING” and “CHALLENGE”.
# Philly_price_Challenge <- Philly_Housing_joined %>%
# filter(toPredict == "CHALLENGE")
#
# Philly_price_Challenge <-
# Philly_price_Challenge %>%
# mutate(SalePrice.Predict = predict(reg.training, Philly_price_Challenge))
Philly_Housing_joined <-
Philly_Housing_joined %>%
mutate(SalePrice.Predict = predict(reg.training, Philly_Housing_joined))
ggplot() +
geom_sf(data = neighborhood, fill = "grey40") +
geom_sf(data = Philly_Housing_joined, aes(colour = q5(SalePrice.Predict)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
name="Quintile\nBreaks") +
labs(title="Map of predicted values") +
mapTheme()